Note: Grading is based both on your graphs and verbal explanations. Follow all best practices as discussed in class, including choosing appropriate parameters for all graphs. Do not expect the assignment questions to spell out precisely how the graphs should be drawn. Sometimes guidance will be provided, but the absense of guidance does not mean that all choices are ok.
Read Graphical Data Analysis with R, Chap. 6, 7
Data source: https://data.ny.gov/Public-Safety/Index-Crimes-by-County-and-Agency-Beginning-1990/ca8h-8gjq
You do not need to submit the data with your assignment. You may either download and read from your local copy or read directly from the web site with df <- read_csv("https://data.ny.gov/api/views/ca8h-8gjq/rows.csv").
GGally::ggparcoord())library(tidyverse)
library(GGally)
df <- read_csv("https://data.ny.gov/api/views/ca8h-8gjq/rows.csv")
df2 <- df %>%
group_by(County, Year) %>%
summarize(Murder = sum(Murder), Rape = sum(Rape), Robbery = sum(Robbery),
Assault = sum(`Aggravated Assault`), Burglary = sum(Burglary),
Larceny = sum(Larceny), `Motor Vehicle Theft` = sum(`Motor Vehicle Theft`),
Region = max(Region)) %>%
ungroup() %>%
filter(County != "Region Total") %>%
filter(Year == 2020) %>%
arrange(desc(Region))
ggparcoord(df2, columns = 3:9, scale = "globalminmax")
GGally::ggparcoord())ggparcoord(df2, columns = 3:9, alphaLines = .1, splineFactor = 7)
We can see that there are two groups of counties - the majority of New York state counties have a similar, lower number of robberies and assaults, and the few outliers with very high values of every number of crime.
Region. Discuss outliers, clusters, and correlations in detail.# Unfortunately there is no way to reorder the counties
# see: https://github.com/timelyportfolio/parcoords/issues/16
library(parcoords)
df2 %>% select(-Year) %>%
parcoords(rownames = F
, brushMode = "1D-axes"
, reorderable = T
, queue = T
, color = list(
colorBy = "Region"
,colorScale = "scaleOrdinal"
,colorScheme = "schemeCategory10"
)
, withD3 = TRUE
)
For the most part, the non-New York City counties have a lower number of crimes, whereas all but one New York city counties have much higher crime. There is one main outlier in each region - Erie (non-New York City) has overall high crime, and Richmond (New York City) has extremely low crime. Monroe and Suffolk (non-NYC) also have a fairly high crime rate overall. Overall, there seems to be a positive correlation between all the crimes (e.g. a county with a higher number of robberies also has a higher number of rapes).
Data: SleepStudy from Lock5withR package
Draw the following graphs and answer the questions.
ClassYear and AnxietyStatus? Between ClassYear and NumEarlyClass? Justify your answers with mosaic plots.library(vcd)
library(Lock5withR)
SleepStudy$AnxietyStatus <- factor(SleepStudy$AnxietyStatus,
levels = c("normal", "moderate", "severe"))
mosaic(AnxietyStatus ~ ClassYear,
data = SleepStudy,
direction = c("v", "h"),
highlighting_fill = RColorBrewer::brewer.pal(3, "Blues"),
spacing = spacing_equal(sp = unit(0, "lines")),
labeling = labeling_border(rot_labels = c(0, 0, 0, 0),
set_varnames = c(ClassYear = "Class Year", AnxietyStatus = "Anxiety\nStatus")))
mosaic(NumEarlyClass ~ ClassYear,
data = SleepStudy,
direction = c("v", "h"),
highlighting_fill = RColorBrewer::brewer.pal(6, "Greens"),
spacing = spacing_equal(sp = unit(0, "lines")),
labeling = labeling_border(rot_labels = c(0, 0, 0, 0),
set_varnames = c(ClassYear = "Class Year", NumEarlyClass = "# of Early Classes")))
For Class years 1 and 2, there is no association between class year and anxiety status. There is a slight assocation for class years 3 and 4, where class year 4 has more moderately anxious people, but overall there is not a significant assocation.
There is a clear assocation between class year and number of early classes. As class year increases, the number of students who take early classes decreases.
chisq.test(SleepStudy$ClassYear, SleepStudy$AnxietyStatus)
##
## Pearson's Chi-squared test
##
## data: SleepStudy$ClassYear and SleepStudy$AnxietyStatus
## X-squared = 4.6165, df = 6, p-value = 0.5938
chisq.test(SleepStudy$ClassYear, SleepStudy$NumEarlyClass)
##
## Pearson's Chi-squared test
##
## data: SleepStudy$ClassYear and SleepStudy$NumEarlyClass
## X-squared = 54.952, df = 15, p-value = 1.819e-06
The p-value for the assocation between class year and anxiety status is 0.59, which means we fail to reject the null hypothesis that there is not an assocation between the two variables. This agrees with our analysis in part a, where the mosaic plots visually showed no assocation. The p-value for the assocation between class year and number of early classes is essentially 0, therefore we reject the null hypothesis that there is no assocation. This also agrees with our visual analysis with the mosaic plots, where we saw a negative correlation between number of early classes and class year.
mosaic(AnxietyStatus ~ NumEarlyClass+ClassYear,
data = SleepStudy,
direction = c("v", "v", "h"),
highlighting_fill = RColorBrewer::brewer.pal(3, "Blues"),
spacing = vcd::spacing_dimequal(c(.3, 0, 0)),
labeling = labeling_border(rot_labels = c(0, 0, 0, 0),
set_varnames = c(ClassYear = "Class Year",
AnxietyStatus = "Anxiety\nStatus",
NumEarlyClass = "# of Early Classes")))
Since some of the spines are very narrow, it is helpful to consider a same bin width mosaic plot, which is essentionally a faceted stacked bar chart based on proportions:
sleep <- SleepStudy %>%
group_by(ClassYear, NumEarlyClass, AnxietyStatus) %>%
summarize(n = n()) %>%
mutate(prop = n/sum(n))
sleep %>%
ggplot(aes(x = ClassYear, y = prop, fill = AnxietyStatus)) +
geom_col() +
facet_wrap(~NumEarlyClass) +
scale_fill_manual(values = RColorBrewer::brewer.pal(3, "Blues")) +
ggtitle("Anxiety Status by Class Year", sub = "faceted on number of early classes") +
theme_classic()
The results are perhaps counterintuitive: there is a clear pattern of severe anxiety in students who take 0 or 2 early classes. (One might have expected more anxiety with a higher number of early classes.) We need to be cautious about this conclusion since the sample sizes for 1, 3, 4, and 5 early classes are so small:
SleepStudy %>% group_by(NumEarlyClass) %>% count
## # A tibble: 6 × 2
## # Groups: NumEarlyClass [6]
## NumEarlyClass n
## <int> <int>
## 1 0 85
## 2 1 14
## 3 2 88
## 4 3 35
## 5 4 11
## 6 5 20
pairs() function to draw a mosaic pairs plot of all categorical (factor) variables in SleepStudy. (Note: The vcd package must be loaded for pairs() to find the correct method.) Name a pair of variables which appear to have a very strong association. Name a pair of variables which appear not to be associated.# fig.width=8 and fig.height=8 are included in chunk options
Lock5withR::SleepStudy %>%
dplyr::select(c(where(is.factor), allNighter, NumEarlyClass, ClassYear)) %>%
table() %>%
pairs(space = .2,
lower_panel = pairs_mosaic(highlighting = 2, spacing = spacing_equal(0)),
upper_panel = pairs_mosaic(spacing = spacing_equal(0)),
diag_panel = pairs_barplot(
gp_vartext = gpar(fontsize = 12),
gp_leveltext = gpar(fontsize = 8),
abbreviate = 2))
some examples of strong assocation variables: (depression status, anxiety status), (sex, class year), (depression status, sex) some examples of weak/no assocition variables: (sex, early class), (lark owl, depression status), (alcohol use, early class)
The file stats_wl.csv contains information about waitlist movement for a Fall 2021 Columbia U undergraduate statistics class.
There are 640 rows and 4 variables:
Name name of student (actual names were replaced with names generated from the randomNames package)
Date since SSOL updates overnight, waitlist positions were collected each morning during the change of program period
Priority position in waitlist, for example 1 = top position on list
Status final outcome, Registered = received a place in class and remained; Dropped Class = received a place in class and left; Left List = left waiting list; Joined = remained on waiting list at the end of the change of program period. (Note that the status reflects what ultimately happened, not what the status was on a particular date.)
Create an alluvial diagram that shows waitlist movement during the change of program period. It is not necessary to include the Name column in the diagram, but it should be possible to observe movement of individual students: for example, that the student who was 22nd in the waitlist on Sept 9th moved up to 15th place on Sept 16th and then left the list.
library(tidyverse)
library(ggalluvial)
df <- read_csv("stats_wl.csv") %>%
mutate(Status = fct_relevel(Status, "Registered", "Dropped Class", "Left List", "Joined"))
ggplot(df, aes(x = Date, y = 1, stratum = fct_rev(factor(Priority)), alluvium = Name)) +
geom_alluvium(aes(fill = Status), color = "grey50") +
geom_stratum(aes(fill = Status), color = "grey50") +
geom_text(stat = "stratum", aes(label = after_stat(stratum)), size = 2, color = "grey40") +
scale_y_reverse(breaks = NULL) +
xlab("") +
ylab("Position in waitlist") +
theme_bw() +
theme(panel.grid.major.y = element_blank(), panel.grid.minor.y = element_blank())
# Alternative solution
g <- list()
for(i in 1:4) {
g[[i]] <- ggplot(df, aes(x = Date, y = 1, stratum = fct_rev(factor(Priority)), alluvium = Name)) +
geom_alluvium(aes(fill = Status, alpha = Status), color = "grey50") +
scale_alpha_manual(values = c(rep(.2, i-1), 1, rep(.2, 4-i))) +
scale_y_reverse(breaks = seq(1, 56, 5) -.5, labels = seq(1, 56, 5)) +
scale_x_date(date_breaks = "1 day", date_labels = "%d") +
ggtitle(levels(df$Status)[i]) +
xlab("Day (September 2021)") +
ylab("Position in waitlist") +
guides(fill = "none", alpha = "none") +
theme_bw() +
theme(panel.grid.major.y = element_blank(), panel.grid.minor.y = element_blank())
}
gridExtra::grid.arrange(g[[1]], g[[2]], g[[3]], g[[4]], ncol = 2)